Oh Rats!:
A Data Science Dive Into Invasive Species
and Island Sustainability
Here we will import many of the industry-standard libraries including NumPy, Pandas, MatPlotLib, and Plotly. Feel free to find these packages on the Python Package Index!
For your reference, we'll break down one line of the large cell below. In this line:
import pandas as pd
We import the pandas library. Essentially, we're telling python that we want to use a specific set of functions that has been named "pandas". We then use the as keyword to specify that we want to use some other name to refer to "pandas"; in this context we told python we want to call it "pd".
Lets say we want to use the pandas.notnull() function (this checks if any values in an array or list are None).
Because we imported "pandas" as "pd", we could now call pd.notnull()
Note: We name pandas as "pd" because it is the standard name in the industry, however we could have said:
import pandas as harrystyles
Which would've allowed us to run Harry_Styles.notnull(). This would be confusing for a reader, so its considered best practice to use standard names such as "pd" for "pandas" or "np" for "numpy".
The cell below runs what we just talked about. Feel free to play around and try things out!
import pandas as pd
import pandas as harrystyles
print("Harry Styles is a Panda?", harrystyles == pd)
truth = [1, 2, 3, None, 4, harrystyles,
"Harry Styles", None, None,
5 * 2 - 1 / 1e-20, 1]
print(harrystyles.notnull(truth))
print(pd.notnull(truth))
# these lines delete the two imports we just made
del harrystyles
del pd
Harry Styles is a Panda? True [ True True True False True True True False False True True] [ True True True False True True True False False True True]
Now, lets get to work! Here we'll import the actual modules we'll need for this notebook!
#necessary to implement `download_data()` [see below]
import requests
from pathlib import Path
import time
#standard data analysis libraries
import numpy as np
import pandas as pd
#imports for displaying, rendering, and saving plots and visualizations
import plotly
import plotly.express as px
from IPython.display import *
import plotly.io as pio
import ipywidgets as widgets
from ipywidgets import *
#for dependency if not on datahub
# >>>%pip install openpyxl
download_data¶The download_data function will get data from data_url on the Internet and saves it so we can use it for analysis!
This is a good practice to get into so that your future research or projects are reproducible. Essentially, if you gave someone else a copy of this notebook, they should be able to run all of the cells and get the same result. So, if we have a cell that downloads the data when run, people don't even need access to the data! This is also great because if the data changes or is updated, we don't have to update the notebook, it will do so automatically (woahhh how cool)!!
Please don't worry about understanding this function, it's rather complex, even for those with a moderate amount of data science experience. You can just use this function and trust it works.
def download_data(data_url,
file):
file_path = Path(file)
print('Downloading...', end=' ')
resp = requests.get(data_url)
with file_path.open('wb') as f:
f.write(resp.content)
print('Done!')
return file_path
#treat this cell as majic, just run it
funcs = download_data("https://tinyurl.com/island-func",
"funcs.py")
from funcs import *
Downloading... Done!
show¶The function .show() below will allow us to freely examine this data! You don't need to understand how it works, just read the following properties:
| Code | Result |
|---|---|
your_table.show(True) |
shows the entire DataFrame |
your_table.show(columns = ['x', 'y']) |
shows the first 20 rows of columns x and y |
your_table.show(rows = 18) |
shows the first 18 rows |
your_table.show(rows = -18) |
shows the last 18 rows |
your_table.show(start = 25, rows = 18) |
shows the first 18 rows that come after row 25 |
In the above table, your_table represents whatever table you are working in
Note: This function is only for ease of exploration, if you don't feel comfortable using it, don't worry! The documentation is below if you want to check out the implementation.
.show()¶Here, we're using a small dataset to show what show can do (pun-intended). Don't worry about understanding the data, just look at the resulting tables
airports = download_data("https://raw.githubusercontent.com/vega/vega-datasets/next/data/airports.csv", "airports.csv")
print(airports)
airports = pd.read_csv(airports)
airports.show(rows = 3,
title = ["First 3 rows", ["b"]])
airports.show(rows =3, columns = ['name', 'iata'],
title = ['First 3 rows with columns "name" and "iata" (in that order)', ["b"]])
airports.show(rows = 3, start = 101,
title = ['3 rows starting with 101', ["b"]])
Downloading... Done! airports.csv
First 3 rows
| iata | name | city | state | country | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 00M | Thigpen | Bay Springs | MS | USA | 31.953765 | -89.234505 |
| 1 | 00R | Livingston Municipal | Livingston | TX | USA | 30.685861 | -95.017928 |
| 2 | 00V | Meadow Lake | Colorado Springs | CO | USA | 38.945749 | -104.569893 |
3376 Rows x 7 Columns
First 3 rows with columns "name" and "iata" (in that order)
| name | iata | |
|---|---|---|
| 0 | Thigpen | 00M |
| 1 | Livingston Municipal | 00R |
| 2 | Meadow Lake | 00V |
3376 Rows x 7 Columns
3 rows starting with 101
| iata | name | city | state | country | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| 101 | 12C | Rochelle Municipal | Rochelle | IL | USA | 41.893001 | -89.078290 |
| 102 | 12D | Tower Municipal | Tower | MN | USA | 47.818333 | -92.291667 |
| 103 | 12J | Brewton Municipal | Brewton | AL | USA | 31.051263 | -87.067968 |
3376 Rows x 7 Columns
Here, we'll use the function above (download_data), to get the data we want.
The data source we'll be looking at is a Microsoft Excel File, or .xls for short, which some of you may know a little bit about. One advantage that an excel file has over other data storage methods (such as .csv files) is that it can have separate "sheets", which are each are their own table/data frame.
To ensure we can work with this data, we read the data which we call island_data twice, each time with the pd.read_excel function (oh how creative computer scientists are at naming things).
The first time we'll load the 0th page, and then the 1st (in non-data-science-terms the 1st and 2nd sheets), using the sheet_name argument.
We call these seedlings and trees for reasons that will soon become evident.
island_data = download_data(
"https://datadryad.org/stash/downloads/file_stream/55546", "island.xls")
seedlings = pd.read_excel(island_data, sheet_name = 0)
trees = pd.read_excel(island_data, sheet_name = 1)
Downloading... Done!
Now, lets explore this data a bit! Below we'll look at what we just downloaded!
The seedlings DataFrame has 275 rows and 5 columns (as noted at the bottom left of the data)
Each row represents one collection/examination. The columns are:
Seedlings Columns
| Title | Meaning | Type |
|---|---|---|
Transect_ID |
a uniqe ID for each sample | str |
P. grandis Count Seedings |
count of grandis seedings | int |
C.nucifera Count Seedings |
count of nucifera seedings | int |
Year |
The year which the sample was taken | int |
Forest Type |
The type of forest the same was taken in | str |
Note: The numbers on the left side of the table are called the
index, its an easy way for us to keep track of what rows are in what order
seedlings.show(rows = 3)
| Transect_ID | P.grandis Count Seedlings | C.nucifera Count Seedlings | Year | Forest Type | |
|---|---|---|---|---|---|
| 0 | VT-Avi-1-C | 0 | 14 | 2007 | Cocos |
| 1 | VT-Avi-2-C | 0 | 37 | 2007 | Cocos |
| 2 | VT-Avi-2-I | 0 | 12 | 2007 | Cocos |
275 Rows x 5 Columns
Now we want to clean the data up. We won't use the Transect_ID column, so lets drop it.
While we're at it lets shorten some of the column names. Here's a list of what we'll change:
New Seedlings Column Labels
| Old | New |
|---|---|
Transect_ID |
REMOVED |
P. grandis Count Seedings |
Pisona Count |
C.nucifera Count Seedings |
Coco Count |
Year |
Year |
Forest Type |
Forest |
Below, we'll use the .drop method to drop the first column. We'll then set the table's column labels (seedlings.columns) equal to a list of new labels. You can view the result of each operation by running the cell.
if 'Transect_ID' in seedlings.columns:
# this if statement allows you to run this cell
# multiple times without encountering an error
seedlings = seedlings.drop(columns="Transect_ID")
seedlings.show(rows = 3,
title = 'Removed <code>Transect_ID</code>')
seedlings.columns = ['Pisona Count', "Coco Count", "Year", "Forest"]
seedlings.show(rows = 3, title = 'Relabeled Columns')
Removed Transect_ID
| P.grandis Count Seedlings | C.nucifera Count Seedlings | Year | Forest Type | |
|---|---|---|---|---|
| 0 | 0 | 14 | 2007 | Cocos |
| 1 | 0 | 37 | 2007 | Cocos |
| 2 | 0 | 12 | 2007 | Cocos |
275 Rows x 4 Columns
Relabeled Columns
| Pisona Count | Coco Count | Year | Forest | |
|---|---|---|---|---|
| 0 | 0 | 14 | 2007 | Cocos |
| 1 | 0 | 37 | 2007 | Cocos |
| 2 | 0 | 12 | 2007 | Cocos |
275 Rows x 4 Columns
The second table we'll look at is called trees. It has 245 rows and 4 columns (again as shown in the same location). Each row here represents a count of certain trees, per a set area. The columns and their meanings are in the table below
Trees Columns
| Title | Meaning | Type |
|---|---|---|
TreeID |
a uniqe ID for each sample | str |
Year |
The year which the sample was taken | int |
Count Seedings (0 - 2m) |
count of grandis seedings | int |
SPECIES |
The species of tree represented | str |
For our purposes, the column TreeID is irrelevant, so we can drop it in the same fashion as before. Lets also relabel the columns as such:
New Trees Column Labels
| Old | New |
|---|---|
TreeID |
REMOVED |
Year |
Year |
Count Seedings (0 - 2m) |
Count |
SPECIES |
Tree |
if "TreeID" in trees.columns:
# this if statement allows you to run this cell
# multiple times without encountering an error
trees = trees.drop(columns = "TreeID")
trees.show(rows = 3, title = "Removed <code>TreeID</code> Column")
trees.columns = ["Year", "Count", "Tree"]
trees.show(rows = 3, title = "Relabeled Columns")
Removed TreeID Column
| Year | Count Seedlings (0 - 2m) | SPECIES | |
|---|---|---|---|
| 0 | 2004 | 0 | Barr |
| 1 | 2004 | 30 | Barr |
| 2 | 2004 | 100 | Barr |
245 Rows x 3 Columns
Relabeled Columns
| Year | Count | Tree | |
|---|---|---|---|
| 0 | 2004 | 0 | Barr |
| 1 | 2004 | 30 | Barr |
| 2 | 2004 | 100 | Barr |
245 Rows x 3 Columns
Now we've got our cleaned tables! Check them out below
seedlings.show(rows = 3, title = "Seedlings")
trees.show(rows = 3, title = "Trees")
Seedlings
| Pisona Count | Coco Count | Year | Forest | |
|---|---|---|---|---|
| 0 | 0 | 14 | 2007 | Cocos |
| 1 | 0 | 37 | 2007 | Cocos |
| 2 | 0 | 12 | 2007 | Cocos |
275 Rows x 4 Columns
Trees
| Year | Count | Tree | |
|---|---|---|---|
| 0 | 2004 | 0 | Barr |
| 1 | 2004 | 30 | Barr |
| 2 | 2004 | 100 | Barr |
245 Rows x 3 Columns
rats_present¶Hmmm...it looks like we're missing something.
We still don't know when the rats were eradicated; good thing we can figure that out! We know that by the year 2011, rats were eradicated from the island. Lets use that to write a function to add to our tables!
def add_rats(table):
def rats_present(x):
if x > 2010:
return "No"
else:
return "Yes"
if "Rats" in table.columns:
show("Rats already exists!")
else:
table["Rats"] = table["Year"].map(rats_present)
show("Rats added")
return table
for data in [seedlings, trees]:
data = add_rats(data)
Rats added
Rats added
first¶Don't worry about the implementation of this function, but it returns the first value of an iterable (or list-like) object.
def first(series):
return series.iloc[0]
Here are some figures that we got from the original research paper (where we got the data from). Later, you'll replicate them and you can decide who's you like more!
tree_type_table¶This function takes in a DataFrame and a string, and does all of the following, each on one line:
Tree has the value namedef tree_type_table(table: pd.DataFrame, name: str):
tree = (table.query(f"Tree == '{name}'")
.groupby("Year")
.agg({"Count": sum, "Rats":first})
.reset_index())
return tree
triple_seedling_table¶This function is a bit more complicated, so we'll go though it quickly; if you're interested, you're more than welcome to try to figure it out, but it is more advanced than the majority of this notebook.
triple_seedlings_table does the following:
main which selects relevant columnsforest which is main, but only in the year $2016$rats which is main which performs a similar aggregation to tree_type_tabledef triple_seedling_table(table: pd.DataFrame, genus: str):
main = (table[[f'{genus} Count','Year', 'Forest', 'Rats']]
.rename(columns = {f'{genus} Count': "Count"}))
forest = (main.drop(columns="Rats")
.query("Year == 2016"))
rats = (main.drop(columns="Forest")
.groupby("Year")
.agg({"Count": sum, "Rats": first})
.reset_index())
return main, forest, rats
plot_rats¶You don't need to worry about the implementation of this function, but it will give you a nicely formatted graph given a Table/DataFrame. This function should ideally have one column named 'Year', one named 'Count', and one named 'Rats'. If it has these three functions you can simply call the following:
plot_rats(data_frame)
If your data frame doesn't have all of these, you'll need to call the function as so:
plot_rats(data = data_frame, x = x axis, y = y axis, color = group column)
where all of the values right of the = are filled in by you. You may also optionally include the argument:
title = "Your Title"
The implementation of the function can be found below, and if you're interested, you can read the documentation of the pandas and plotly modules which make this function work.
def plot_rats(data, x = None, y = None, color = None, title = None):
cols = pd.Series(['Year', 'Count', 'Rats'])
present = cols.isin(data.columns).all()
if present:
fig = px.bar(data_frame = data, x='Year', y="Count", color = 'Rats',
color_discrete_sequence =['#a63932', '#003262'],
title = title, text_auto = True) #, width = 1000, height = 500)
else:
fig = px.bar(data_frame = data, x=x, y=y, color = color,
color_discrete_sequence =['#a63932', '#003262'],
title = title) #, width = 1000, height = 500)
fig.update_xaxes(type = 'category')
fig.show()
plot_forest¶This function functions in a very similar fashion. It takes in a dataframe, and outputs an interactive bar graph!
def plot_forest(df, title = None):
grouped = df.groupby(["Year", "Forest"]).sum().reset_index().drop(columns="Year")
plot = px.bar(data_frame = grouped, x = "Forest", y = "Count", color = "Forest", text_auto = True,
color_discrete_sequence = ["#823e2b",'#85683c', "#3e853c"], title = title) #, width = 1000, height = 500
plot.show()
The long awaited end has arrived! Theres no more work for you to do in this section. Simply run the cell below to see what all of your hard work has created. Go you!
Play Around a bit with the cell below after running it to view the tables you just made!
@interact(Function = widgets.Dropdown(options = ["tree_type", "seedlings"]))
def the_tables(Function):
if Function == "tree_type":
@interact(Tree = widgets.Dropdown(options = ['Barr', 'Calo', 'Cord', 'Guet', 'Hern']))
def show_me_tree_type(Tree):
tree_type_table(trees, Tree).show()
else:
@interact(Tree = widgets.Dropdown(options = ['Pisona', "Coco"]),
Rows = widgets.BoundedIntText(value = 10, max = 30, min = 1))
def show_me_seedlings(Tree, Rows):
a, b, c = triple_seedling_table(seedlings, Tree)
for df in [a, b]:
df.show(rows = Rows)
This cell is just nicely formatted and supplies all the tables from the data, using only your functions!
(barrs, calos, cords, guets, herns,
(pisonia, pisona_forest, pisona_rats),
(coco, coco_forest, coco_rats)) = (tree_type_table(trees, 'Barr'),
tree_type_table(trees, 'Calo'),
tree_type_table(trees, 'Cord'),
tree_type_table(trees, 'Guet'),
tree_type_table(trees, 'Hern'),
triple_seedling_table(seedlings, "Pisona"),
triple_seedling_table(seedlings, "Coco"))
tree_dict = {"Plot Forests": {"Pisona": pisona_forest, "Coco": coco_forest},
"Plot Rats": {"Barr": barrs, "Calo": calos, "Cord": cords, "Guet": guets, "Hern": herns, "Pisona": pisona_rats, "Coco": coco_rats}}
Use the cell below to view some of the graphs you just made!
@interact(show_all = widgets.ToggleButton(value=False, description='Show All Tables', icon = "eye", button_style = "success"),
Function = widgets.Dropdown(options = ["Plot Forests", "Plot Rats"]))
def the_plots(show_all, Function):
if show_all:
frst_lbl, sdlgs_lbl = iter(['Cocos', "Pisonas"]), iter(["Barrs", "Calos", "Cords", "Guets", "Herns", "Pisonas", "Cocos"])
for data in (coco_forest, pisona_forest):
plot_forest(data, title = next(frst_lbl))
for data in (barrs, calos, cords, guets, herns, pisona_rats, coco_rats):
plot_rats(data, title = next(sdlgs_lbl))
elif Function == "Plot Rats":
@interact(Tree = widgets.Dropdown(options = ['Barr', 'Calo', 'Cord', 'Guet', 'Hern']))
def show_me_rats(Tree):
plot_rats(tree_dict[Function][Tree], title = Tree)
else:
@interact(Tree = widgets.Dropdown(options = ['Pisona', "Coco"]))
def show_me_forest(Tree):
plot_forest(tree_dict[Function][Tree], title = Tree)